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Abstract 

Non-convex optimization is ubiquitous in machine learning. The Majorization-Minimization 
(MM) procedure systematically optimizes non-convex functions through an iterative con¬ 
struction and optimization of upper bounds on the objective function. The bound at each 
iteration is required to touch the objective function at the optimizer of the previous bound. 

We show that this touching constraint is unnecessary and overly restrictive. We general¬ 
ize MM by relaxing this constraint, and propose a new optimization framework, named 
Generalized Majorization-Minimization (G-MM) that is more flexible compared to MM. 

For instance, it can incorporate application-specific biases into the optimization procedure 
without changing the objective function. We derive G-MM algorithms for several latent 
variable models and show empirically that they consistently outperform their MM coun¬ 
terparts in optimizing non-convex objectives. In particular, G-MM algorithms appear to 
be less sensitive to initialization. 

Keywords: majorization-minimization, non-convex optimization, latent variable models, 
expectation maximization 

1. Introduction 

Non-convex optimization is ubiquitous in machine learning. For example, data clustering 
(MacQueen, 1967; Arthur and Vassilvitskii, 2007), training classifiers with latent variables 
(Yu and Joachims, 2009; Felzenszwalb et ah, 2010; Pirsiavash and Ramanan, 2014; Azizpour 
et ah, 2015),and training visual object detectors from weakly labeled data (Song et ah, 2014; 
Rastegari et ah, 2015; Ries et ak, 2015) all lead to non-convex optimization problems. 

Majorization-Minimization (MM) (Hunter et ak, 2000) is an optimization framework 
for designing well-behaved optimization algorithms for non-convex functions. MM algo¬ 
rithms work by iteratively optimizing a sequence of easy-to-optimize surrogate functions 
that bound the objective. Two of the most successful instances of MM algorithms are 
Expectation-Maximization (EM) (Dempster et ak, 1977) and the Concave-Convex Proce- 
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dure (CCCP) (Yuille and Rangarajan, 2003). However, both have a number of drawbacks 
in practice, such as sensitivity to initialization and lack of uncertainty modeling for latent 
variables. This has been noted in works such as (Neal and Hinton, 1998; Felzenszwalb et ah, 
2010; Parizi et ah, 2012; Kumar et ah, 2012; Ping et ah, 2014). 

We propose a new procedure. Generalized Majorization-Minimization (G-MM), for opti¬ 
mizing non-convex objective functions. Our approach is inspired by MM, but we generalize 
the bound construction process. Specifically, we relax the strict bound construction method 
in MM to allow for a set of valid bounds to be used, while still maintaining algorithmic 
convergence. This relaxation gives us more freedom in bound selection and can be used to 
design better optimization algorithms. 

In training latent variable models and in clustering problems, MM algorithms such as 
CCCP and fc-means are known to be sensitive to the initial values of the latent variables 
or cluster memberships. We refer to this problem as stickiness of the algorithm to the 
initial latent values. Our experimental results show that G-MM leads to methods that tend 
to be less sticky to initialization. We demonstrate the benefit of using G-MM on multiple 
problems, including A:-means clustering and applications of Latent Structural SVMs to image 
classification with latent variables. 

1.1 Related Work 

Perhaps the most famous iterative algorithm for non-convex optimization in machine learn¬ 
ing and statistics is the EM algorithm (Dempster et ah, 1977). EM is best understood in 
the context of maximum likelihood estimation in the presence of missing data, or latent 
variables. EM is a bound optimization algorithm: in each E-step, a lower bound on the 
likelihood is constructed, and the M-step maximizes this bound. 

Countless efforts have been made to extend the EM algorithm since its introduction. 
In (Neal and Hinton, 1998) it is shown that, while both steps in EM involve optimizing some 
functions, it is not necessary to fully optimize the functions in each step; in fact, each step 
only needs to “make progress”. This relaxation can potentially avoid sharp local minima 
and even speed up convergence. 

In another attempt. Hunter et al. (2000) proposed the Majorization-Minimization (MM) 
framework. MM generalizes methods like EM by “transferring” the optimization to a se¬ 
quence of surrogate functions (bounds) on the original objective function. The Concave- 
Convex Procedure (CCCP) (Yuille and Rangarajan, 2003) is another widely-used instance 
of MM, where the surrogate function is obtained by linearizing the concave part of the 
objective function. Many successful machine learning algorithms employ CCCP, e.g. the 
Latent SVM (Eelzenszwalb et ah, 2010). Other instances of MM algorithms include iter¬ 
ative scaling (Pietra et ah, 1997), and non-negative matrix factorization (Lee and Seung, 
1999). Another related line of research concerns the Difference-of-Convex (DC) program¬ 
ming (Tao, 1997), which can be shown to reduce to CCCP under certain conditions. Con¬ 
vergence properties of such general “bound optimization” algorithms have been discussed 
by Salakhutdinov et al. (2002). 

Despite widespread success, MM (and CCCP in particular) has a number of drawbacks, 
some of which have motivated our work. In practice, CCCP often exhibits stickiness to ini¬ 
tialization, which necessitates expensive initialization or multiple trials (Parizi et ah, 2012; 
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Figure 1: Optimization of a function F us¬ 
ing MM (red) and G-MM (blue). 
In MM the bound 62 has to touch 
F at tci- In G-MM we only re¬ 
quire that 62 be below bi at wi, 
leading to several choices .62 • 


Algorithm 1: G-MM optimization 

Input: wo , r ],€ 

1: Vo := F { wo ) 

2: for t := 1, 2,... do 

3: select bt G Bt = B{wt-i,vt-i) as in (4) 

4: wt := argmin.^ bt{w) 

5: dt ■■=bt{wt) - F{wt) 

6: Vt := bt{wt) - r]dt 

7: ii dt < e break 

8: end for 
Output: Wt 


Song et ah, 2014; Cinbis et ah, 2016). In optimizing latent variable models, GGGP lacks 
the ability to incorporate application-specific information without making modifications to 
the objective function, such as prior knowledge or side information (Xing et ah, 2002; Yu, 
2012), latent variable uncertainty (Kumar et ah, 2012; Ping et ah, 2014), and posterior reg¬ 
ularization (Ganchev et ah, 2010). Our framework can deal with these drawbacks. Our key 
observation is that we can relax the constraint enforced by MM that requires the bounds 
to touch the objective function, and this relaxation gives us the ability to better avoid 
sensitivity to initialization, and to incorporate side information. 

A closely related work to ours is the pseudo-bound optimization framework by Tang 
et ah (2014), which generalizes GGGP by using “pseudo-bounds” that may intersect the 
objective function. In contrast, our framework still uses valid bounds, but only relaxes the 
touching requirement. Also, the pseudo-bound optimization framework is not as general 
as ours in that it is designed specifically for optimizing binary energies in MRFs, and it 
restricts the form of surrogate functions to parametric max-flow. 

The generalized variants of EM proposed and analyzed by Neal and Hinton (1998) 
and Gunawardana and Byrne (2005) are related to our work when we restrict our attention 
to probabilistic models and the EM algorithm. EM can be viewed as a bound optimiza¬ 
tion procedure where the likelihood function involves both the model parameters 9 and a 
distribution q over the latent variables, denoted by F{6, q). Ghoosing q to be the posterior 
leads to a lower bound on F that is tight at the current estimate of 9. Generalized versions 
of EM, such as those given by Neal and Hinton (1998), use distributions other than the 
posterior in an alternating optimization of F. This fits into our framework, as we use the 
exact same objective function, ans only change the bound construction step (which amounts 
picking the distribution q in EM). We propose both stochastic and deterministic strategies 
for bound construction, and demonstrate that they lead to higher quality solutions and less 
sensitivity to initialization than other EM-like methods. 

2. Proposed Optimization Framework 

We consider minimization of functions that are bounded from below. The extension to 
maximization is trivial. Let F{w) : —)• M be a lower-bounded function that we wish 
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to minimize. We propose an iterative procedure that generates a sequence of solutions 
wo,wi,... until it converges. The solution at iteration t > 1 is obtained by minimizing 
an upper bound bt{w) to the objective function i.e. wt = The bound at 

iteration t is chosen from a set of “valid” bounds Bt (see Figure 1). In practice, we assume 
that the members of Bt come from a family T of functions that upper bound F and can 
be optimized efficiently, such as quadratic functions, or quadratic functions with linear 
constraints. Algorithm 1 gives the outline of the approach. 

This general scheme is used in both MM and G-MM. However, as we shall see in the 
rest of this section, MM and G-MM have key differences in the way they measure progress 
and the way they construct new bounds. 

2.1 Progress Measure 

MM measures progress with respect to the objective values. To guarantee progress over 
time MM requires that the bound at iteration t must touch the objective function at the 
previous solution, leading to the following constraint: 

MM constraint: bt{wt-i) = F{wt-i). (1) 

This touching constraint, together with the fact that wt minimizes bt leads to F{wt) < 
F{wt-i). That is, the value of the objective function is non-increasing over time. 

The touching requirement is restrictive and, in practice, can make it hard to avoid 
local minima. In particular, this can make MM algorithms such as GGGP sensitive to 
initialization (Parizi et ah, 2012; Cinbis et ah, 2016; Song et ah, 2014). The touching 
constraint also eliminates the possibility of using bounds that do not touch the objective 
function but may have other desirable properties. 

In G-MM, we measure progress with respect to the bound values. It allows us to relax 
the touching constraint of MM, stated in (1), and require instead that, 

G-MM constraints: (2) 

\bt{wt-i) < bt-i{wt-i). 


Note that the G-MM constraints are weaker than MM. In fact, (1) implies (2) because bt-i 
is an upper bound on F. 

The weaker constraints used in G-MM do not imply F{wt) < F{wt-i), but, are sufficient 
to ensure that yt,F{wt) < F{wo). This follows from the fact that bt is an upper bound of 
F, Wt is a minimizer of bt{w), and (2): 

F{wt) < bt{wt) < bt{wt-i) < • • • < bi{wi) < bi{wo) = F{wo). (3) 


2.2 Bound Construction 

This section describes step 3 of Algorithm 1, To construct new bounds, G-MM considers a 
“valid” subset of the upper bounds in F that satisfy (2). We denote the set of valid bounds 
at iteration thyBt- We consider two scenarios for constructing bounds in G-MM. The hrst 
scenario is when we have a bias function g : Bt x ^ over the set of valid bounds. 
The function g takes in a bound b ^ Bt and a current solution tc G and returns a scalar 
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indicating the goodness of the bound. In this scenario we select the bound with the largest 
bias value i.e. bt = argmax^g^^ g{b,wt-i). 

In the second scenario we propose to choose one of the valid bounds from Bt at random. 
Thus, we have both a deterministic (the 1^* scenario) and a stochastic (the 2"^*^ scenario) 
bound construction mechanism. MM algorithms, such as EM and CCCP, fall into the first 
scenario. 

In general MM algorithms are instances of G-MM that use a specific bias function 
g{b,w) = —b{w). The bound that touches F at w maximizes this bias function. It also 
maximizes the amount of progress with respect to the previous bound at w. By choosing 
bounds that maximize progress, MM algorithms tend to rapidly converge to a nearby local 
minimum. For instance, at iteration t, the CCCP bound for latent SVMs is obtained by 
fixing latent variables in the concave part of the objective function to the best values accord¬ 
ing to the previous solution wt-i, making the new solution wt = argmin^, 6 ^( 11 ;) attracted 
to wt-i- Similarly, in the E-step, EM sets the posterior distribution of the latent variables 
conditioned on the data according to the model from the previous iteration. Thus, in the 
next maximization step, the model is updated to “match” the fixed posterior distributions. 
This explains one reason why MM algorithms are observed to be sticky to initialization. 

G-MM offers a more flexible bound construction scheme than MM. In Section 5 we show 
empirically that picking bounds uniformly at random from the set of valid bounds is less 
sensitive to initialization and leads to better results compared to CCCP and EM. We also 
show that using good bias functions can further improve performance of the learned models. 

To guarantee convergence, we restrict the set of valid bounds Bt to those that are below 
a threshold vt-i at the previous solution wt-i: 

Bt = B{wt-i,vt-i) 

B{w,v) = {6 G T" I b{w) < u}. (4) 

Initially, we set vq = F{wo) to ensure that the first bound touches F. In the next 
iterations, we set vt using a progress coefficient rj G ( 0 , 1 ] that ensures making, at least, 
rjdt progress where dt = bt{wt)—F{wt) is the gap between the bound and the true objective 
value at wt- 

Note that when 77 = 1 all valid bounds touch F at wt-i (see (4)), corresponding to the 
MM requirement. Small rj values allow for gradual exploratory progress in each step while 
large rj values greedily select bounds that guarantee immediate progress. 

3. Convergence of G-MM 

We showed in (3) that, in any iteration t, the G-MM solution is no worse than the initial 
solution in terms of the objective value, i.e. F(wt) < F{wo). In this section we also show 
that the gap between the G-MM bounds and the true objective converges to zero at the 
minimizers of the bounds (Theorem 1). Moreover, we show, under mild conditions, G- 
MM converges to a solution (Theorem 2) that is a local extremum or a saddle point of 
F (Theorem 3). Other convergence properties of G-MM depend on the structure of the 
objective function F, the family of the bounds F, and the details of the bound selection 
strategy. 
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Theorem 1. In the limit as t ^ oo, G-MM bounds and F touch at the minimizer of the 
bounds, i.e. limt_^oo dt = 0, where dt = ht{wt) — F{wt). 

Proof. We have bt{wt) < bt{wt-i) < ft-i, where vt is defined as in line 6 of Algorithm 1, 
The first inequality holds because wt minimizes bt and the second inequality follows from 
(4). Summing over t gives 

T T T-l 

'^bt{wt) <'^vt-i=VQ + '^ht{wt)-'qdt (5) 

t=i t=i t=i 

T 

=> v'^dt < Vo - briwT) = F{wo) - bT{wT). (6) 

t=i 

Let w* G argmin^ T(rt;) be an unknown global minimum of F. Using F(w*) < F(wt) < 
bT{wT) and (6), we have 

T 

dt < F{wo) — F{w*) lim dt = 0. (7) 

^^ t^oo 

t=l 

If a global minimizer of F does not exist, we can use the assumption that F is bounded 
from below and replace F(w*) in (7) with the value of the lower bound. □ 

Theorem 2. If F is a family of m-strongly convex funetions where m > 0, then s.t. 
limi^oo i-e. G-MM converges to a solution. 

Proof. Let / be an m-strongly convex function, and let x, y be two points in the domain of 
/, also let g £ df he a subgradient of /. We have 

f{y)> fix)+g'^{y-x) + '^\\x-y\\‘^. ( 8 ) 

After substituting f = bt, x = wt, and y = wt-i in (8), and noting that the zero vector is a 
subgradient oi bt at wt, i.e. 0 G dbt{wt), we get; 

y {bt{wt-i) - bt{wt)) >\\wt- wt-i\f. (9) 

We can replace bt{wt-i) by vt-i in (9) due to (4). By adding the inequalities for T iterations, 
similar to what we did in the proof of Theorem 1, we get: 

T 

^ {F{wo) - F{w*)) > ^ \\wt - wt-i\f (10) 

t=i 

s.t. lim wt = . (11) 

t^OO 

Similar to the proof of Theorem 1, if a global minimum does not exist, we can replace F{w*) 
with the value of the lower bound of U in (10). □ 

Theorem 3. Let F be continuously differentiable, and F be a family of strongly convex and 
smooth functions. Further, i/3oo > M > m > 0,V6 G F,MI F V‘^b{w) F ml, where I 
is the identity matrix, then VF{w^) = 0; that is, G-MM converges to a local extremum or 
saddle point of F. 

We postpone the proof of Theorem 3 to the appendix. 
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4. Derived Optimization Algorithms 

For simplicity and ease of exposition, we primarily focus on demonstrating G-MM on la¬ 
tent variable models where bound construction naturally corresponds to imputing latent 
variables in the model. In this section we derive G-MM algorithms for two models widely 
used in machine learning, namely, /c-means clustering and Latent Structural SVM, both 
belonging to the category of latent variable models. It is worth mentioning that G-MM is 
fully capable of handling more general non-convex problems. 


4.1 fe-means Clustering 

Let {xi,..., Xn} denote a set of sample points and w = (/Ui,..., denote a set of cluster 
centers. We use Zi G {1,..., /c} to denote the index of the cluster assigned to the i-th sample 
point. The objective function in /c-means clustering is defined as follows, 

k 

F{w) = y^mm\\xi - w = {/ii,...,/ik). ( 12 ) 

^ 2i = l 
2 = 1 


Bound construction: We obtain a convex upper bound on F by fixing the latent variables 
(zi,..., Zn) to certain values instead of minimizing over these variables. Bounds constructed 
this way are quadratic convex functions of tc = (^i,..., 


F = 



Vi, Zj G {1,... ,/c} 



(13) 


The /c-means algorithm is an instance of MM methods. The algorithm repeatedly assigns 
each example to its nearest center to construct a bound, and then updates the centers 
by optimizing the bound. We can set g{b, w) = —b{w) in G-MM to obtain the /c-means 
algorithm. We can also define g differently to obtain a G-MM algorithm that exhibits other 
desired properties. For instance, a common issue in clustering is cluster starvation. One 
can design a bias function that encourages balanced clusters by selecting g appropriately. 

We can select a random bound from Bt by sampling a latent configuration z = {zi,..., Zn) 
uniformly from the set of configurations leading to valid bounds. Specifically, we start from 
a valid initial configuration {e.g. /c-means solution) and do a random walk on a graph whose 
nodes are latent configurations defining valid bounds. The neighbors of a latent configura¬ 
tion z are the latent configurations that can be obtained by changing the value of one of 
the n latent variables in z. 

Bound optimization: Optimization of a bound b £ F can be done in closed form by 
setting fij to be the mean of all examples assigned to cluster j: 

V. Xi 

^ —TTl—’ <n \ Zi= j}. (14) 

FT 


4.2 Latent Structural SVM 

A Latent Structural SVM (LS-SVM) (Yu and Joachims, 2009) defines a structured output 
classifier with latent variables. It extends the Structural SVM (Joachims et ah, 2009) by 
introducing latent variables. 
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Let {(xi, ?/i),..., {xn, Vn)} denote a set of labeled examples with Xi £ X and yi G y. 
We assume that each example x* has an associated latent value Zi £ Z. Let 4>{x,y, z) : 
X 3^ X —)■ denote a feature map. A vector re G defines a classifier y : X ^ y, 

y(x) = argmax(maxrt; • (/)(x, y, z)). (15) 

y ^ 

The LS-SVM training objective is defined as follows, 

A 1 ” 

P{w) = xlkiP + - max(r(; • (f){xi,y,z) + A(y, y^)) - maxte • (/)(xj, y*, z)), (16) 

2 n \ z / 

2 = 1 

where A is a hyper-parameter that controls regularization and A(y, yi) is a non-negative 
loss function that penalizes the prediction y when the ground truth label is yi. 

Bound construction: As in the case of /c-means, a convex upper bound on the LS- 
SVM objective can be obtained by imputing latent variables. Specifically, for each example 
Xj, we fix Zi £ Z, and replace the maximization in the last term of the objective with a 
linear function w ■ 4>{xi,yi, Zi). This forms a family of convex piecewise quadratic bounds, 

[a 1 ” 

-^=Soll^lP + -y^max(u; • (j){xi,y,z) + A{y,yi)) - w ■ (j){xi,yi, Zi) 

2 y,z 

\ 2=1 

The CCCP algorithm for LS-SVM selects the bound bt defined by zj = argmax^. wt-i ■ 
4>{xi,yi, Zi). This particular choice is a special case of G-MM when g{b,w) = —b{w). 

To generate random bounds from Bt we use the same approach as in the case of /c-means 
clustering. We perform a random walk in a graph where the nodes are latent configurations 
leading to valid bounds, and the edges connect latent configurations that differ in a single 
latent variable. 

Bound optimization: Optimization of a bound b £ J- corresponds to a convex 
quadratic program and can be solved using different techniques, including gradient based 
methods (e.y. SGD) and the cutting-plane method (Joachims et ah, 2009). We use the 
cutting-plane method in our experiments. 

4.2.1 Bias Function for Multi-fold MIL 

The multi-fold MIL algorithm (Cinbis et ah, 2016) was introduced for training latent SVMs 
for weakly supervised object localization, to deal with stickiness issues in training with 
CCGP. It modifies how latent variables are updated during training. Cinbis et al. (2016) 
divide the training set into K folds, and updates the latent variables in each fold using a 
model trained on the other K — 1 folds. This algorithm does not have a formal convergence 
guarantee. By defining a suitable bias function, we can derive a G-MM algorithm that 
mimics the behavior of multi-fold MIL, and yet, is convergent. 

Consider training an LS-SVM. Let S = (1,... ,n) be an ordered sequence of training 
sample indices. Also, let Zi £ Z denote the latent variable associated to training example 
(xi,yj). Let I denote a subsequence of S. Also, let z\ denote the fixed latent variable 
values of training examples in I in iteration t, and let w{I,z\) be the model trained on 
{{xi,yi)\i £ 1} with latent variables fixed to z^j in the last maximization of (16). 


Vi, Zi £ Z\ 


(17) 
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We assume access to a loss function £{w,x,y, z). For example, for the binary latent 
SVM where y G {—1,1}, i is the hinge loss: £{w, x, y, z) = max{0 ,1 — yw ■ (f){x, z)}. 

We start by considering the Leave-One-Out (LOO) setting, i.e. K = n, and call the 
algorithm of Cinbis et al. (2016) LOO-MIL in this case. The update rule of LOO-MIL in 
iteration t is to set 


zj = argrnin i (^w{S\i, Zg^l),Xi, yi, z'^ , Vi G S. (18) 

After updating the latent values for all training examples, the model w is retrained by 
optimizing the resulting bound. 

Now let us derive the bias function for a G-MM algorithm that mimics the behavior of 
LOO-MIL. Recall from Equation 17 that each bound b G Bt is associated with a joint latent 
configuration z{b) = (zi,..., Zn)- We use the following bias function: 

gib, w) = - (vj{S\i, 4\i), Xi, yi, Zi'j . (19) 

ieS 

Note that picking a bound according to (19) is equivalent to the LOO-MIL update rule of 
(18) except that in (19) only valid bounds are considered; that is bounds that make at least 
? 7 -progress. 

For the general multi-fold case (i.e. K < n), the bias function can be derived similarly. 

5. Experiments 

We evaluate G-MM and MM algorithms on fc-means clustering and LS-SVM training on 
various datasets. Recall from (4) that the progress coefficient y defines the set of valid 
bounds Bt in each step. CCCP and standard fc-means bounds correspond to setting y = 1, 
thus taking maximally large steps towards a local minimum of the true objective. 

5.1 fe-means Clustering 

We conduct experiments on four different clustering datasets: Norm-25 (Arthur and Vas- 
silvitskii, 2007), D31 (Veenman et ah, 2002), Cloud (Arthur and Vassilvitskii, 2007), and 
GMM-200. Norm-25, D31, and GMM-200 are synthetic and Cloud is from real data. See 
the references for details about the datasets. GMM-200 was created by us. It is a Gaussian 
mixture model on 2-D data with 200 mixture components. Each component is a Gaussian 
distribution with a = 1.0 and ^ on a square of size 70 x 70. The means are placed at least 
2.5cj apart from each other. The dataset contains 50 samples per mixture component. 

We compare results from three different initializations: forgy selects k training exam¬ 
ples uniformly at random without replacement to define initial cluster centers, random 
partition assigns training samples to cluster centers randomly, and fc-means-| —|- uses the 
algorithm in (Arthur and Vassilvitskii, 2007). In each experiment we run the algorithm 50 
times and report the mean, standard deviation, and the best objective value (Equation 12). 
Table 1 shows the results using A:-means (hard-EM) and G-MM. We note that the variance 
of the solutions found by G-MM is typically smaller than fc-means. Moreover, the best and 
the average solutions found by G-MM are always better than (or the same as) those found 
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dataset 

k 

opt. 

method 

forgy 

random partition 

fe-meansH—|- 

avg ± std 

best 

avg ± std 

best 

avg ± std 

best 

Norm-2 5 

25 

hard-EM 

1.9e5±2e5 

7.0e4 

5.8e5±3e5 

2.2e5 

5.3e3±9e3 

1.5 

G-MM 

9.7e3±le4 

1.5 

2.0e4±0 

2.0e4 

4.5e3±8e3 

1.5 

D31 

31 

hard-EM 

1.69 ±0.03 

1.21 

52.61 ±47.06 

4.00 

1.55 ±0.17 

1.10 

G-MM 

1.43 ±0.15 

1.10 

1.21 ±0.05 

1.10 

1.45 ±0.14 

1.10 

Cloud 

50 

hard-EM 

1929 ± 429 

1293 

44453 ± 88341 

3026 

1237 ±92 

1117 

G-MM 

1465 ± 43 

1246 

1470 ± 8 

1444 

1162 ±95 

1067 

GMM-200 

200 

hard-EM 

2.25 ±0.10 

2.07 

11.20 ±0.63 

9.77 

2.12 ±0.07 

1.99 

G-MM 

2.04 ±0.09 

1.90 

1.85 ±0.02 

1.80 

1.98 ±0.06 

1.89 


Table 1; Comparison of G-MM and A:-means (hard-EM) on multiple clustering datasets. 

Three different initialization methods were compared; forgy initializes cluster 
centers to random examples, random partition assigns each data point to a 
random cluster center, and fc-meansH —|- implements the algorithm from (Arthur 
and Vassilvitskii, 2007). The mean, standard deviation, and best objective values 
out of 50 random trials are reported. Both /c-means and G-MM use the exact same 
initialization in each trial. G-MM consistently converges to better solutions. 


by /c-means. This trend generalizes over different initialization schemes as well as different 
datasets. 

Although random partition seems to be a very bad initialization for fc-means on all 
datasets, G-MM recovers from it. In fact, on D31 and GMM-200 datasets, G-MM ini¬ 
tialized by random partition performs better than when it is initialized by other methods 
(including /c-means-|—|-). Also, the variance of the best solutions (across different initializa¬ 
tion methods) in G-MM is smaller than that of /c-means. These suggest that the G-MM 
optimization is less sticky to initialization than /c-means. 

Figure 2 visualizes the result of /c-means and G-MM (with random bounds) on the D-31 
dataset (Veenman et ah, 2002), from the same initialization. G-MM finds a near perfect 
solution, while in /c-means many clusters get merged incorrectly or die off. Dead clusters 
are those which do not get any points assigned to them. The update rule of Equation (14) 
collapses the dead clusters on to the origin. 

Figure 3 shows the effect of the progress coefficient on the quality of the solution found 
by G-MM. Different colors correspond to different initialization schemes. The solid line 
indicates the average objective over 50 iterations, the shaded area covers one standard 
deviation from the average, and the dashed line indicates the best solution over the 50 
trials. Smaller progress coefficients allow for more extensive exploration, and hence, smaller 
variance in the quality of the solutions. On the other hand, when the progress coefficient 
is large G-MM is more sensitive to initialization {i.e. is more sticky) and, thus, the quality 
of the solutions over multiple runs is more diverse. However, despite the greater diversity, 
the best solution is worse when the progress coefficient is large. G-MM reduces to A:-means 
if we set the progress coefficient to 1 {i.e. the largest possible value). 
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Figure 2: Visualization of clustering solntions on the D31 dataset (Veenman et ah, 2002) 
from identical initializations. Random partition initialization scheme is used, (a) 
color-coded ground-truth clusters, (b) solution of fc-means. (c) solution of G-MM. 
The white crosses indicate location of the cluster centers. Color codes match up 
to a permutation. 



Figure 3: Effect of the progress coefficient r/ (x-axis) on the quality of the solutions found 
by G-MM (y-axis) on two clnstering datasets. The quality is measured by the 
objective function in (12). Lower values are better. The average (solid line), the 
best (dashed line), and the variance (shaded area) over 50 trials are shown in the 
plots and different initializations are color coded. 


5.2 Latent Structural SVM for Image Classification and Object Detection 

We consider the problem of training an LS-SVM classifier on the mammals dataset (Heitz 
et ah, 2009). The dataset contains images of six mammal categories with image-level an¬ 
notation. Locations of the objects in these images are not provided, and therefore, treated 
as latent variables in the model. Specifically, let x be an image and y be a class label 
(y G {1,... ,6} in this case), and let z be the latent location of the object in the image. 
We define (/>(x, y, z) to be a feature function with 6 blocks; one block for each category. It 
extracts features from location z of image x and places them in the y-th block of the ontput 
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Opt. Method 

center 

top-left 

random 

objective 

test error 

objective 

test error 

objective 

test error 

CCCP 

1.21 ± 0.03 

22.9 ± 9.7 

1.35 ± 0.03 

42.5 ± 4.6 

1.47 ± 0.03 

31.8 ± 2.6 

G-MM random 

0.79 ± 0.03 

17.5 ± 3.9 

0.91 ± 0.02 

31.4 ± 10.1 

0.85 ± 0.03 

19.6 ± 9.2 

G-MM biased 

0.64 ± 0.02 

16.8 ± 3.2 

0.70 ± 0.02 

18.9 ± 5.0 

0.65 ± 0.02 

14.6 ± 5.4 


Table 2; LS-SVM results on the mammals dataset (Heitz et ah, 2009). We report the mean 
and standard deviation of the training objective (Equation 16) and test error 
over five folds. Three strategies for initializing latent object locations are tried: 
image center, top-left corner, and random location. “G-MM random” uses random 
bounds, and “G-MM bias” uses a bias function inspired by multi-fold MIL (Ginbis 
et ah, 2016). Both variants consistently and significantly outperform the GGGP 
baseline. 


and fills the rest with zero. We use the following multi-class classification rule: 

y{x) = argmaxrc • y, z), w = (rci,..., w^). (20) 

y,z 

In this experiment we use a setup similar to that in (Kumar et ah, 2012): we use Histogram 
of Oriented Gradients (HOG) for the image feature (j), and the 0-1 classification loss for 
A. We set A = 0.4 in Equation 16. We report 5-fold cross-validation performance. Three 
initialization strategies are considered for the latent object locations: image center, top-left 
corner, and random locations. The first is a reasonable initialization since most objects are 
at the center in this dataset; the second initialization strategy is somewhat adversarial. 

We try a stochastic as well as a deterministic bound construction method. For the 
stochastic method, in each iteration t we uniformly sample a subset of examples St from 
the training set, and update their latent variables using zj = argmax^. • 4){xi,yi, Zi). 
Other latent variables are kept the same as the previous iteration. We increase the size of 
St across iterations. 

For the deterministic method, we use the bias function that we described in Section 4.2.1, 
This is inspired by the multi-fold MIL idea (Ginbis et ah, 2016) and is shown to reduce 
stickiness to initialization, especially in high dimensions. We set the number of folds to 
iL = 10 in our experiments. 

Table 2 shows results on the mammals dataset. Both variants of G-MM consistently 
outperform GGGP in terms of training objective and test error. We observed that GGGP 
rarely updates the latent locations, under all initializations. On the other hand, both 
variants of G-MM significantly alter the latent locations, thereby avoiding the local minima 
close to the initialization. Figure 4 visualizes this for top-left initialization. Since objects 
rarely occur at the top-left corner in the mammals dataset, a good model is expected to 
significantly update the latent locations. Averaged over five cross-validation folds, about 
90% of the latent variables were updated in G-MM after training whereas this measure was 
2.4% for GGGP. This is consistent with the better training objectives and test errors of 
G-MM. 

We provide additional experimental results on the mammals dataset. Figure 5 shows 
example training images and the final imputed latent object locations by three algorithms: 
GGGP (red), G-MM random (blue), and G-MM biased (green). The initialization is top-left. 
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Figure 4: Latent location changes after learning, in relative image coordinates, for all five 
cross-validation folds, for the top-left initialization on the mammals dataset. Left 
to right: CCCP, “G-MM random”, “G-MM biased” (iL=10). Each cross repre¬ 
sents a training image; cross-validation folds are color coded differently. Averaged 
over five folds, CCCP only alters 2.4% of all latent locations, leading to very bad 
performance. “G-MM random” and “G-MM biased” alter 86.2% and 93.6% on 
average, respectively, and perform much better. 


In most cases CCCP fails to update the latent locations given by initialization. The 
two G-MM variants, however, are able to update them significantly and often localize the 
objects in training images correctly. This is achieved only with image-level object category 
annotations, and with a very bad (even adversarial) initialization. 

5.3 Latent Structural SVM for Scene Recognition 

We implement the reconfigurable model of Parizi et al. (2012) to do scene classification on 
MIT-Indoor dataset (Quattoni and Torralba, 2009), which has images from 67 indoor scene 
categories. We segment each image into a 10x10 regular grid and treat the grid cells as 
image regions. We train a model with 200 shared parts. All parts can be used to describe 
the data in an image region. We use the pre-trained hybrid GonvNet from Zhou et al. (2014) 
to extract features from image regions. We record from the 4096 neurons at the penultimate 
layer of the network and use PGA to reduce the dimensionality of these features to 240. 

The reconfigurable model is an instance of LS-SVM models. The latent variables are the 
assignments of parts to image regions and the output structure is the multi-valued category 
label predictions. LS-SVMs are known to be sensitive to initialization (a.k.a. the stickiness 
issue). Parizi et al. (2012) cope with this issue by training a generative version of the model 
first and using it to initialize the discriminative model. Generative models are typically 
less sticky but perform worse in practice. To validate the hypothesis regarding stickiness of 
LS-SVMs we run training using several initialization strategies. 

Initializing training entails the assignment of parts to image regions i.e. setting zfs in 
Equation 17 to define the first bound. To this end we first discover 200 parts that capture 
discriminative features in the training data. We then run graph cut on each training image 
to obtain part assignments to image regions. Each cell in the 10x10 image grid is a node 
in the graph. Two nodes in the graph are connected if their corresponding cells in the 
image grid are next to each other. Unary terms in the graph cut are the dot product 
scores between the feature vector extracted from an image region and a part filter plus the 
corresponding region-to-part assignment score. Pairwise terms in the graph cut implement 
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Figure 5: Example training images from the mammals dataset, shown with final imputed 
latent object locations by three algorithms: CCCP (red), G-MM random (blue), 
G-MM biased (green). Initialization: top-left. 
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Opt. Method 

Random 

A = 0.00 

A = 0.25 

A = 0.50 

A = 1.00 

Acc.% ± std 

O.F. 

Acc. % 

O.F. 

Acc. % 

O.F. 

Acc. % 

O.F. 

Acc. % 

O.F. 

CCCP 

41.94 ± 1.1 

15.20 

40.88 

14.81 

43.99 

14.77 

45.60 

14.72 

46.62 

14.70 

G-MM random 

47.51 ±0.7 

14.89 

43.38 

14.71 

44.41 

14.70 

47.12 

14.66 

49.88 

14.58 

G-MM biased 

49.34 ±0.9 

14.55 

44.83 

14.63 

48.07 

14.51 

53.68 

14.33 

56.03 

14.32 


Table 3; Performance of LS-SVM trained with CCCP and G-MM on MIT-Indoor dataset. 

We report classification accuracy (Acc.%) and the training objective value (O.F.). 
Columns correspond to different initialization schemes. “Random” assigns random 
parts to regions. A controls the coherency of the initial part assignments: A = 1 
(A = 0) corresponds to the most (the least) coherent case. “G-MM random” uses 
random bounds and “G-MM biased” uses the bias function of Equation 21, rj = 0.1 
in all the experiments. Coherent initializations lead to better models in general, 
but, they require discovering good initial parts. “G-MM” outperforms CCCP, 
especially with random initialization. “G-MM biased” performs the best. 


a Potts model that encourages coherent labelings. Specifically, the penalty of labeling two 
neighboring nodes differently is A and it is zero otherwise. A controls the coherency of the 
initial assignments. We experiment using A G {0,0.25,0.5,1}. We also experiment with 
random initialization, which corresponds to assigning Zj’s randomly. This is the simplest 
form of initialization and does not require discovering initial part filters. 

We do G-MM optimization using both random and biased bounds. For the latter we 
use a bias function g{b, w) that measures coherence of the labeling from which the bound 
was constructed. Recall from Equation 17 that each bound in b £ Bt corresponds to 
a labeling of the image regions. We denote the labeling corresponding to the bound b 
by z{b) = (zi,..., Zn) where Zi = ..., 2 : 1 , 100 ) specifies part assignments for all the 

100 regions in the f-th image. Also, let E{zi) denote a function that measures coherence 
of the labeling Zi. In fact, E{zi) is the Potts energy function on a graph whose nodes 
are Zi^i, ..., Zi^ioQ. The graph respects a 4-connected neighborhood system (recall that Zi^r 
corresponds to the r-th cell in the lOx 10 grid defined on the i-th image). If two neighboring 
nodes Zi^r and Zi^s get different labels the energy E{zi) increases by 1. For biased bounds 
we use the following bias function which favors bounds that correspond to more coherent 
labelings: 

n 

g{b,w) =-J2E{zi), z{b) = {zi,...,Zn). (21) 

i=l 

Table 3 compares performance of models trained using CCCP and G-MM with random 
and biased bounds. For G-MM with random bounds we repeat the experiment five times 
and report the average over these five trials. Also, for random initialization, we do five trials 
using different random seeds and report the mean and standard deviation of the results. 
G-MM does better than CCCP under all initializations. It also converges to a solution with 
lower training objective value than CCCP. Our results show that picking bounds uniformly 
at random from the set of valid bounds is slightly (but consistently) better than committing 
to the CCCP bound. We get a remarkable boost in performance when we use a reasonable 
prior over bounds (i.e. the bias function of Equation 21). With A = 1, CCCP attains 
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experiment 

setup 

MM 

G-MM 

random 

biased 

scene 

recognition 

A = 0.0 

145 

107 

87 

A = 1.0 

65 

69 

138 

data 

clustering 

(GMM-200) 

forgy 

35.76 ±7.8 

91.52 ±4.4 

rand. part. 

114.98 ± 12.9 

241.89 ±2.1 

A:-means-|--|- 

32.92 ±5.8 

80.78 ±2.9 

data 

clustering 

(Cloud) 

forgy 

37.18 ± 12.1 

87.68 ± 15.4 

rand. part. 

65.14 ± 18.7 

138.64 ±5.9 

A:-means-|--|- 

21.3 ±4.1 

44.12 ± 10.7 


Table 4; Comparison of the number of iterations it takes for MM and G-MM to converge 
in the scene recognition and the data clustering experiment with different initial¬ 
izations and/or datasets. The numbers reported for the clustering experiment are 
the average and standard deviation over 50 trials. 

accuracy of 46.6%, whereas G-MM attains 49.9%, and 56.0% accuracy with random and 
biased initialization respectively. Moreover, G-MM is less sensitive to initialization. 

5.4 Running Time 

G-MM bounds make a fraction of the progress made in each bound construction step com¬ 
pared to the MM bound. Therefore, we would expect G-MM to take more steps before it 
converges. We report the number of iterations that MM and G-MM take to converge in 
Table 4. The results for G-MM depend on the value of the progress coefficient r/ which is 
set to match the experiments in the paper; rj = 0.02 for the clustering experiment (Section 
5.1) and rj = 0.10 for the scene recognition experiment (Section 5.3). 

The overhead of the bound construction step depends on the application. For example, 
in the scene recognition experiment, optimizing the bounds takes orders of magnitude more 
than sampling them (a couple of hours vs. a few seconds). In the clustering experiment, 
however, the optimization step is solved in closed form whereas sampling a bound involves 
performing a random walk on a large graph which can take a couple of minutes to run. 

6. Conclusion 

We have introduced Generalized Majorization-Minimization (G-MM), a generic iterative 
bound optimization framework that generalizes upon Majorization-Minimization (MM). 
Our key observation is that MM enforces an overly-restrictive touching constraint in its 
bound construction mechanism, which is inflexible and can lead to sensitivity to initializa¬ 
tion. By adopting a different measure of progress, in G-MM, this constraint is relaxed to 
allow for more freedom in bound construction. Specifically, we propose deterministic and 
stochastic ways of selecting bounds from a set of valid ones. This generalized bound con¬ 
struction process tends to be less sensitive to initialization, and enjoys the ability to directly 
incorporate rich application-specific priors and constraints, without modifications to the ob¬ 
jective function. In experiments with several latent variable models, G-MM algorithms are 
shown to significantly outperform their MM counterparts. 
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Future work includes applying G-MM to a wider range of problems and theoretical anal¬ 
ysis, such as convergence rate. We also note that, although G-MM is more conservative than 
MM in moving towards nearby local minima, it still requires making progress in every step. 
Another interesting research direction is to enable G-MM to occasionally pick bounds that 
do not make progress with respect to the solution of the previous bound, thereby making 
it possible to get out of local minima, while still maintaining the convergence guarantees of 
the method. 

Appendix A. Proof of Theorem 3 

We have proved two theorems in the main paper. In Theorem 1, we showed that limt^oo dt = 
0, where dt = bt{wt) — F(wt) is the gap between the G-MM bound and the objective function 
at iteration t at the minimizer of the bound. In Theorem 2, we showed that G-MM converges 
to a solution, or s.t. limt_^oo wt = . 

In this appendix we show that under under mild conditions, our G-MM framework 
converges to a local extremum or a saddle point of the objective function. Theorem 3 roughly 
states that, if F is continuously differentiable, and the family of bounds are smooth and 
have bounded curvature, then G-MM converges to a stationary point; that is, VF{w^) = 0. 

Proof. We briefly overview the sketch of the proof first before explaining the steps in detail. 
We prove the theorem by contradiction by showing that if VT(r(;^^) ^ 0 then we can construct 
a lower bound on F (stated in (29)) that, at some point w, goes above an upper bound 
of bt (stated in (33)). This implies that 3w s.t. F{w) > bt{w) which contradicts with the 
assumption that bt is an upper bound of F. The proof has three steps: 1) constructing the 
lower bound on F, 2) constructing the upper bound on bt, and 3) finding a point w s.t. 
F{w) > bt{w). Details are presented below. 

Step 1 : Assume ||VF(rc ^)||2 / 0. This can be stated as 

3A>0 s.t. \\VF{w ^)\\2 = 2A. (22) 

Consider value of F along the gradient direction at . We denote this by g{z) where z G M 
and define it as 


g{z) = F{w^ + zu), u = . (23) 

||VF(r;T)||2 

In what follows, we only consider the case where z > 0. We can write the following equalities: 

g{z) = ^(-2) = VF{w^ + zu) ■ u (24) 

g'iO) = \\VFiw ^)\\2 = 2A (25) 

g{0) = F{w^). (26) 

In Figure 6a we show a non-convex function F in M^. The gradient direction at a 
particular point p is also shown in the figure. Figure 6b shows the same function from the 
top view, together with the gradient direction at p. Figure 6c shows how the value of F 
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Figure 6: An example of a non-convex function F on (a) shows the plot of F] the value 
of the function on the line that touches F at the point p = (0.3, —0.2) is marked 
by the black and white line, (b) shows F from the top view; it also marks the 
point p with the red circle and shows the direction of the gradient vector at p. 
The length of the vector is set arbitrarily for visualization purposes, (c) shows 
the value of F along the tangent line as a one dimensional function (see Equation 
23); it also shows the gradient direction at z = 0 (in red) and the linear function 
of Equation 29 that lower bounds F (in green). 


changes along the tangent line as a function of the step size z along the gradient direction 
u (see g{z) in Equation 23). 

Since F is differentiable, we can use the mean value theorem to rewrite g for z > 0 as 

Vz > 0,3A: G [0, z] s.t. g{z) = g{0) + g{k)z. (27) 

Further, since we assume F to be continuously differentiable, so is g. Continuity of g' 
implies that in a small neighborhood around 0, g' is larger than A\ 

3(5 > 0 s.t. Vz G [0,(5], g'{z) > A. (28) 

We can use this inequality in (27) and get a lower bound on g in the neighborhood of 0, or 
equivalently, a lower bound on F in the vicinity of and in the direction of u: 

3(5 > 0,Vz G [0, (5], g{z) > g'(O) + Az 
=> 3(5 > 0, Vz G [0, (5], + zri) > F(r(;^) + Az. (29) 

This lower bound is shown in Figure 6c in green. 

Step 2: To get an upper bound on ht, we note that Vbt{wt) = 0, since wt = argmin^ btiw) 
is the minimizer of bt and bt is differentiable. From this and the assumption that the cur¬ 
vature of all bounds in all directions is finite {i.e. V6 G F,V‘^b{w) ^ MI) we can upper 
bound bt with a quadratic function with curvature M whose minimum occurs at w = wt 
and has value bt{wt), as follows: 

bt{w) < {w - wt)'^MI{w - Wt) + bt{wt),yw. (30) 
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Similar to step 1, we only focus on the points w that lie on the line + zu and get the 
following bound on bt'- 

bt{w^ + zu) < M\\w^ — wt + zu\\‘^ + bt(wt) (31) 

< 2M\\w'^ — tctip + 2Mz^ + bt{wt),yz G M. (32) 

Finally, from Theorem 2 we have that Vei > 0, 3Ti s.t. Vt > Ti, — wt \\2 < ei. We can 
use this in (32) to get the following strict upper bound: 

bt{w^ + zu) < 2Mel + 2Mz^ + bt{wt),yt > Ti, Vz G M. (33) 

Step 3: We show that 3z G [0, <5] for which the right hand side of (29) is larger than the 
right hand side of (33), which leads to the contradiction that for w = +zu, F{w) > bt{w). 
In other words, we need to find z G [0, 5] such that F{w^) + Az > 2Mef + 2Mz‘^ + bt{wt), or 

2 Mz^ -Az + 2Mef + bt(wt) - F(w'^) < 0. (34) 

Lemma 1. Ve 2 > 0, 3 T 2 s.t. Vt > T 2 , bt(iot)—F(w^) < € 2 - 
Proof. To prove the lemma we first show that 

lim \bt{wt) - F{w^)\ = 0, (35) 

t^OO 

and since 6 i(r(;i), b 2 {w 2 ),... is a non-increasing sequence, \/w, bt{wt) > F{w^), thus we can 
drop the absolute value. 

To show (35), we can write 

\bt{wt)-F{w'^)\ < \bt{wt) - F{wt)\ + \F{wt) - F(ti;t)|. 

The first term on the right hand side can get arbitrarily close to zero due to Theorem 1. 
The second term, also, can get arbitrarily close to zero due to Theorem 2 and that F is 
Lipschitz continuous (because it is continuously differentiable). □ 

Applying lemma 1 to inequality (34), now we only need to show that 

3z G [0, (5], 2Mz^ -Az + 2Mef + 62 < 0. (36) 

Note that ei and €2 can be chosen arbitrarily. In particular, we can set them so that the 
discriminant of the quadratic function in (36) is positive: 

A = - 4 (2M) (2Me? + £ 2 ) > 0 ^ 2Mef + 62 < ^. (37) 

oM 

This guarantees that the quadratic function in (36) has two distinct roots. If ei = 62 = 
0, 2Mef -|- 62 = 0, the two roots are zi = 0 and Z 2 = 5 ^ > 0. For 61,62 > 0, only the 
constant 2M6^ -|- 62 is affected which shifts the quadratic function up. We can now control 
61 and 62 to drive the constant arbitrarily close to 0 , causing zi > 0 to get arbitrarily close 
to zero, and thus fall within the interval [0,5]. Therefore (34) can be satisfied. 
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Figure 7: The quadratic function in the left hand side of (36). When ei = 62 = 0, the 
quadratic function has two roots, zi = 0 and Z 2 = ^ > 0 (shown in blue). 
When ei, 62 > 0 but properly chosen, the quadratic function is shifted up but still 
has two distinct roots (shown in red). 


To summarize, if we assume ||VF(t(;l )||2 = 2A > 0, then we have a lower bound on F 
along the gradient direction u at (step 1): 3<5 > 0, V 2 G [0, <5], F{w'^ + zu) > F{w'^) + Az. 
Along the same direction, ht can be upper bounded for sufficiently large t (step 2): Vei > 
0,3Ti s.t. Vt > Ti^bt{w^ + zu) < 2Mef + 2Mz^ + bt{wt). Finally, we can find zi G [0,5] 
such that F{w^) + Azi > 2Mef + 2Mzf + bt{wt) (step 3), leading to F{w'^ + ziu) > 
bt{w^+ziu). However, this violates the assumption that bt is an upper bound of F, therefore 
||VF(R;t)||2 = 0. □ 
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